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Abstract —This paper introduces RankMap, a platform-aware 
end-to-end framework for efficient execution of a broad class 
of iterative learning algorithms for massive and dense datasets. 
Our framework exploits data structure to scalably factorize it 
into an ensemble of lower rank subspaces. The factorization 
creates sparse low-dimensional representations of the data, a 
property which is leveraged to devise effective mapping and 
scheduling of iterative learning algorithms on the distributed 
computing machines. We provide two APIs, one matrix-based 
and one graph-based, which facilitate automated adoption of 
the framework for performing several contemporary learning 
applications. To demonstrate the utility of RankMap, we solve 
sparse recovery and power iteration problems on various real- 
world datasets with up to 1.8 billion non-zeros. Our evaluations 
are performed on Amazon EC2 and IBM iDataPlex servers using 
up to 244 cores. The results demonstrate up to two orders of 
magnitude improvements in memory usage, execution speed, and 
bandwidth compared with the best reported prior work, while 
achieving the same level of learning accuracy. 

Index Terms —Dense and Big Data, Large-Scale Distributed 
Computing, Iterative Machine Learning, Subspace Lactorization. 

I. Introduction 

Many modern learning algorithms are based on explor¬ 
ing the underlying patterns, correlations, and dependencies 
present across the signals in the dataset. Some prominent 
examples of such algorithms and their applications are linear 
or penalized regression ll29l . power iterations BOl . belief 
propagation ll48ll . and expectation maximization lIMI . 133. 
In all of these settings, solving the underlying objective 
function requires iterative updates of parameters of interest 
until convergence is achieved. Such iterative updates often 
require matrix multiplications that involve the data dependency 
or Gram matrix. In scenarios where data is too large to fit 
on a single computing node and must be distributed, iterative 
dependency-based updates become challenging as they incur 
large computation and communication costs. 

To facilitate parallel computing, a number of distributed 
abstractions that target iterative learning algorithms have been 
developed, e.g.. Pregel iTSll . Spark ifSOl . and GraphLab 1341 . 
These abstractions adopt a graph-parallel model which consists 
of a sparse graph and a kernel function that runs in parallel 
on each vertex l25ll . Performance gains are achieved due to 
the communication-minimizing partitioning of the graph and 
effective control of data movement. 

While graph-parallelism has been shown to accelerate ma¬ 
chine learning and signal processing tasks for sparse graphs. 


this approach cannot be readily applied when the data exhibit 
a non-sparse dependency matrix. The storage of such data 
in a graph format becomes very inefficient as it requires 
storing a large number of edges (pairwise non-zero correlation 
values) for each vertex (data sample). In addition, finding 
efficient graph cuts and partitions is infeasible when dense 
dependencies exist. Data with dense dependencies appear in a 
wide range of fields such as computer vision, medical image 
processing, boundary element methods and their applications, 
and N-body problems m, Ea. Thus, finding efficient so¬ 
lutions for running iterative learning algorithms on densely 
dependent data is of paramount importance. 

In this paper, we introduce RankMap, a novel distributed 
framework for efficient execution of a broad class of iterative 
learning algorithms on datasets with dense but structured 
dependencies. Our key observation is that, despite the apparent 
high dimensionality of data, in many settings, dense datasets 
are low rank or lie on a union of much lower dimensional 
subspaces. We exploit this property to reduce the overhead 
associated with processing dense data dependencies—a factor 
which has rendered the currently available graph-parallel ab¬ 
stractions impractical for processing dense datasets. RankMap 
provides a set of interfaces and transformations that enable 
efficient data-aware content analysis, as well as coordinated 
mapping and optimization to the specifics of the underlying 
hardware components. RankMap significantly improves the 
runtime and energy consumption of the learning algorithms 
by reducing the amount of required computation, distributed 
system communication, and storage. 

To accelerate large matrix multiplications required to com¬ 
pute an iterative update, we decompose dense but structured 
data and rewrite it as a product of two matrices with far 
fewer non-zeros than the original data. The decomposed data 
is then used in subsequent iterative learning algorithms in lieu 
of the original dense data. We introduce a host of automated 
methods for partitioning the decomposed factors and ordering 
the computation flow in a distributed setting. The partitioning 
algorithm is efficient (within a bound from the optimum) and 
has a constant runtime. We introduce two different representa¬ 
tions and accompanying computational models (a matrix-based 
and a vertex-centric model) to compute an update. Depending 
on the data domain and the sparsity of the decomposed 
components, there are different regimes where each of these 
two models deliver highest efficiency. 

We provide APIs for both matrix-based and vertex-centric 
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iterative update models on the transformed data. Our APIs 
are open-source and available at 0. Our matrix-based API 
uses the general Message Passing Interface (MPI). Our vertex¬ 
centric API is based upon the GraphLab programming model. 
We develop an efficient mapping of the iterative computations 
on the sparsified decomposed data within the constraints of 
the GraphLab distributed framework. Both APIs are written 
in C-H-. We evaluate RankMap on the Amazon Elastic Cloud 
(EC2) computing service and IBM iDataPlex computer cluster. 
Our experiments utilize up to 244 cores on 12 large computing 
nodes. 

Our explicit contributions are as follows: 


• We propose RankMap, a large-scale learning framework 
that proposes sparse transformations for accelerating it¬ 
erative learning algorithms on dense but structured data. 

• We introduce a scalable transformation which maps struc¬ 
tured (low-rank) data onto two matrices which contain far 
fewer number of non-zeros. A systematic way to tune the 
transformation error to achieve a desired level of accuracy 
in the learning applications is provided. 

• We develop efficient distributed computational models to 
conduct iterative updates on the decomposed data. Highly 
effective partitioning methods for the decomposed data 
along with data-aware performance bounds are provided. 

• We perform proof-of-concept evaluation on applications 
including eigenvalue decomposition, denoising, and clas¬ 
sification that demonstrate up to two orders of magnitude 
improvement in runtime and memory footprint. 


This paper is organized as follows. In SectionITO we provide 
a global overview of RankMap. In Section |III| we review 
related work. In Section |IV] we introduce our novel data 
transformation algorithm. In Section [V] we study the impact 
of the decomposition error on learning and provide a method 
for automatically tuning RankMap to produce a user-specified 
learning error. In Section VI we introduce schemes for cost- 
efficient distributed partitioning, along with the details of 
our graph and matrix-based computational models. In Section 
VII we provide evaluation results on multiple synthetic and 


real-world datasets. Einally, in Section VIII we discuss the 
practicality of our framework, describe domain-specific use- 
cases of each of the proposed computational models, and 
conclude. 


II. RankMap Eramework 


A D V 



Eig. 1: Schematic of decomposing a dense data matrix into the 
product of a small dense matrix and a large sparse matrix. 


vector, denoted by x, according to an update function of the 
following form: 

^iter+l ^ ( 1 ) 


where /(•) is a low-complexity function and iter is the current 


iteration. Examples are included in Section II-B 


When G is massive and dense, each distributed update in 
0 becomes very expensive. To cope with large data sizes, 
RankMap creates an approximation to G, denoted by G, to 
reduce the cost of an update. To be more specific, our aim is 
to decompose the data matrix A into two components, i.e., 
A = DV, where D G contains a subset of columns 

from A, V € is a sparse matrix, and rank(A) < I 

n (see Eigure [^1. After decomposing A, we then efficiently 
partition the decomposed data and perform distributed updates 
using G = (DV)^(DV). Mapping the original dense data 
to a decomposed model directly reduces the memory usage, 
and the costly computational operations and communication 
incurred by the iterative updates. 

When A is low-rank, it is possible to construct a re¬ 
duced decomposition that is exact (A = DV). However, 
we demonstrate that for many real-world datasets, we can 
achieve significant performance improvements in exchange 
for a small decomposition error (A « DV). We discuss the 
connection between the decomposition error and the accuracy 
of a target learning method as well as strategies for tuning the 
decomposition to achieve a desired level of accuracy in the 
iterative learning algorithms. 

RankMap consists of three main components (see Eigure 
1^: (i) A scalable data decomposition that shrinks the size 
of the data set by leveraging the data’s structure, (ii) A 
data partitioning scheme along with an execution flow for 
performing iterative updates on the decomposed data that 
significantly reduces the distributed computing costs, (iii) A 
systematic method for tuning the decomposition error (denoted 
by 5d) to achieve the desired level of approximation error in 
the learning algorithms (denoted by 5l). 


A. Overview and approach 

In this paper, we introduce RankMap, a distributed data- 
aware framework that efficiently executes learning algorithms 
applied to the data. The main idea underlying our approach is 
to leverage structure in large collections of data to decompose 
the correlation (Gram) matrix of the data such that the system 
costs (e.g, runtime, memory, and energy) associated with 
iterative learning algorithms are significantly reduced. 

Let the data matrix A G denote a collection of n 

signals of m-dimensions, and G = A^A denote the Gram 
matrix. Many learning algorithms iteratively update a solution 


B. Target applications 

Our framework can be used for a broad class of optimization 
problems that are solved via iterative updates based upon the 
Gram matrix. A large number of objective functions used 
in machine learning, e.g., penalized regression methods such 
as the LASSO or BPDN 0, and ridge regression ||29]| . are 
typically solved using iterative updates. In all these settings, 
the complexity of executing these methods is dominated by 
costly iterative computation on the Gram matrix of the dataset. 

To ground RankMap in real-world problems, we now 
discuss two particular learning algorithms that are evaluated 
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Fig. 2; An overview of RankMap framework. The method is divided into two main phases, the decomposition phase (left) 
and the execution phase (right). To execute iterative updates, we provide two computational models: a matrix-based model 
(implemented in MPI) and a graph-based model (implemented in GraphLab). 


in this paper: (i) sparse approximation and (ii) the power 
method for eigenvalue decomposition. 

(i) Sparse approximation for image denoising and clas¬ 
sification. Sparse representation is used in a wide range of 
signal processing and machine learning applications, including 
denoising ID, classification in, clustering M, and outlier 
detection ini. The sparse approximation objective function 
can be written in terms of the ^i-norm as follows: 

argmin ||Ax-y ||2 + A|ix||i, (2) 

X 

where x is a sparse representation of y with respect to A 
and A is a regularization coefficient (increasing this parameter 
promotes sparser solutions). 

In an image denoising application, y is a noisy image, x is a 
sparse coefficient vector, and Ax is a denoised approximation 
of y. In a classification application, the sparse coefficient 
vector X is used to determine which class a test signal y 
belongs to. This can be done by first measuring the sum of 
the coefficients in each class and then finding the class that 
has largest number of nonzero coefficient. 

In the sequel, we evaluate the performance of RankMap 
for accelerating first-order methods for sparse approximation 
via -minimization. This sparse approximation problem can 
be solved using the following iterative soft thresholding (1ST) 
algorithm HD: 

^iter+l ^ _ A^y)), (3) 

where /(.) is a low-complexity thresholding operation (e.g., a 
soft-thresholding operator HD) to account for the term A||x||i 
at each iteration, and 7 is the step size. In our evaluations, we 
employ a variant of this algorithm called FISTA 16]. FISTA 
is an example of a projected gradient descent (PGD) methods 
which provide a generalization of standard gradient 
descent methods for certain classes of non-smooth objective 
functions. RankMap can be readily applied to cost functions 


that can be solved using PGD. 


(ii) Power method for eigenvalue decomposition. The power 
method is a simple and iterative algorithm that can be used to 
sequentially find the eigenvectors and eigenvalues of a matrix 
in descending order. Recall that an eigenvector x of a matrix 
A satisfies the following relationship Ax = ux, where a is 
the eigenvalue associated with the eigenvector x. To find an 
eigenvector of the symmetric matrix G = A^A, the power 
method utilizes the following iterative update: 




Q^iter 


(4) 


Once the power method converges to an estimate of an 
eigenvector x, the contribution of this eigenvector is removed 
from A, and the power method is applied again to the residual 
to find the next eigenvector. 

In both applications described above, the main cost of each 
iteration is due to the computation of Gx, especially when G 
is large, dense, and distributed onto multiple computing nodes. 
For example, as a case-study in our evaluations, we perform 
image reconstruction on a dataset where A is a collection of 
light field image patches of size 18,496 x 100, 000. In this case, 
to reconstruct a single noisy image patch y, more than 3.6 
billion floating point multiplications are required to perform 
Gx**®’' = A^Ax**®’' per iteration. 


III. Background and Related Work 
In this section, we provide background on methods for 
matrix factorization and describe related work. 


A. Methods for matrix factorization 

High-dimensional data can be modeled by the low rank 
structures that are present in the data. Extracting low dimen¬ 
sional structures not only reduces dimensionality, but also 
mitigates the effect of noise and improves the performance 
of learning and inference tasks HD, HD- 
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1) Singular value decomposition (SVD): In settings where 
the column span of A admits a low rank model, the SVD 
provides a powerful tool for forming low rank approximations. 
Let A = USV^ be the SVD. The best rank-fc approximation 
of A is given by Afc = U/cS^V^, where Ufc G ]^™xfc 

Vfe G are the truncated left and right singular vectors 

(first k columns of U and V) and G contains the 

first k singular values of A along its diagonal. The rank of 
Afc equals the number of non-zero singular values. The trun¬ 
cated SVD also provides the solution to principal components 
analysis (PCA), which seeks to find a fc-dimensional subspace 
that best approximates A in the least-squares sense ll46l . 

The complexity of computing the SVD directly is rri^n. 
Thus, for large datasets, the power method is used to find 
the eigenvectors of A^A and AA^, which correspond to the 
right and left singular vectors respectively. 

2 ) Sparse factorization: The SVD provides a closed-form 
solution for finding the best rank-k approximation to a matrix. 
However, in many settings, enforcing sparse structure, either 
in the left or right singular vectors can provide a more faithful 
and compact decomposition of the data. Two widely used 
sparse factorization methods include sparse PCA (SPCA) asa 
and dictionary learning (DL) a. However, these approaches 
are often not applied to large datasets since computing an 
update of both the left and right factor matrices, at each iter¬ 
ation is costly. To solve SPCA on big datasets, a generalized 
power method can be employed l30l. The basic idea behind 
using the power method to find sparse principal components 
is to simply threshold the output of each power iteration to 
ensure the resulting eigenvectors are sparse. Unfortunately, 
the convergence of this method is much slower than standard 
power iterations. 

3) Column subset selection (CSS)-based matrix factoriza¬ 
tion: An alternative strategy for low rank matrix factorization 
is to form a decomposition based upon columns (or rows) from 
the data. CSS-based solutions form an approximate matrix 
decomposition in which one factorized component is a subset 
of the columns of the data itself ifThll . CSS-based approaches 
have been used to provide a scalable and efficient strategy 
for finding approximate solutions to least-squares regression 
ca, Gramian matrix decomposition El, image denoising 
and clustering 113, and also in spectral clustering EH. After 
selecting columns from A, the remaining unsampled columns 
are completed by finding the least-squares projection onto the 
subspace spanned by the sampled columns. 

B. Generic distributed abstractions 


sparse datasets, efficient data partitioning is not possible when 
the graph-representation of the data is densely connected. 
Furthermore, such tools mostly rely on the communication 
between the vertices for computation. When the data is densely 
connected, the resulting communication congestion makes the 
computation dramatically slow. Because of this, most of these 
tools are designed based on the assumption that the input data 
is sparse ll25l . 1351, l49l . 

By design, MapReduce-based solutions are not guaranteed 
to be fast, instead they provide easy and reusable programming 
frameworks that operate on very large datasets on a distributed 
computing platform. Users only have to deal with writing 
the functions of the algorithm in the given MapReduce-based 
programming model. MapReduce, on the other hand, controls 
the distributed cluster, manages data partitioning and data 
transfers between the various parts of the system, and provides 
fault tolerance. 

IV. Column selection-based sparse decomposition 
(CSSD) 

In this section, we present a scalable method for matrix de¬ 
composition (the Decomposition phase in Figure which we 
call Column Selection-Based Sparse Decomposition (CSSD). 

A. Overview of CSSD method 

The main idea behind CSSD is to first select a subset of 
columns from A, and then use this subset of columns as a basis 
from which we form sparse representations of the remaining 
columns. We thus factorize the data as A = DV, where 
D is formed by subsampling and normalizing the selected 
columns of A. Each column of V is then computed by finding 
the sparse approximation of the corresponding column of A 
with respect to D. This sparse approximation problem can 
be solved by an efficient greedy routine called orthogonal 
matching pursuit (OMP) ca. We provide pseudocode for 
CSSD in Algorithmic 

1) Step 1. Sequential column selection: In order to ensure 
that the total approximation error in our factorization is 
sufficiently small, we must ensure that the columns selected 
from A to form D well approximate the range of the original 
matrix. Thus, we employ a sequential method to adaptively 
select columns that are not well approximated by the current 
set of columns El. 

Adaptive column selection methods select a new batch of 
columns according to the following probability distribution; 


A number of successful distributed abstractions for process¬ 
ing large datasets on clusters have been proposed. Examples 
include MapReduce lfT3l . Apache Spark Il49l . and SystemML 
1231 . However, these models become less efficient for applica¬ 
tions when direct data-parallelism does not exist. Several new 
distributed abstractions have been proposed that model data 
dependency in a graph format, most notably Pregel llTSl and 
GraphLab ll34l . They use a vertex-centric computation model, 
in which the user-defined programs are executed on each 
vertex in parallel. As graph-based abstractions are suited for 


p{i) oc 


||AgA|a, -a^lla 


(5) 


where p{i) equals the probability of selecting the column 
from A (denoted by a^), S contains the indices of columns 
already selected, and As denotes the sub-matrix of sampled 
columns. We can flexibly execute this subsampling approach 
by either specifying the maximum number of columns to 
select and/or specifying the maximum amount of error in each 
unsampled column of A. 
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Algorithm 1 : Column Selection Sparse Matrix Decompo¬ 
sition_ 

Input: Matrix A € error tolerance S]j, number of 

columns to select at each iteration Ig, and the maximum 
number of columns to select 1. 

Output: A sparse matrix V G and a dense matrix 

D G such that for each column of a of A, ||a — 

Dv|| 2 < Sn- 

Initialize: Initialize D by adding and normalizing Ig 
columns from A with uniform random sampling. 

Step 1: Sequential column selection 
while ncoIs(D) < I do 

I. Update D by selecting and normalizing Ig columns 
from A according to the distribution in Q. 

II. If the £ 2 -norm of each column of E = A — DD+A 
is less than Sd, return D and proceed to Step 2 to 
compute V. 

end while 

Step 2. Sparse approximation 

I. Compute V by applying Batch OMP to solve (|^ 
with error tolerance Sd- 


2) Step 2. Sparse approximation: After selecting a subset 
of I columns As G we normalize each column such 

that all the columns in matrix D have unit norm. Now, to 
form the sparse matrix V, we find a sparse representation 
of the remaining columns in A (i.e., A_s) in terms of the 
normalized dictionary D. The problem is formally written as 
follows: 

min ||v||o s.t. < Sd, Vi ^ S. (6) 

where ||v||o counts the number of nonzero coefficients in 
V and Jd is a user-specified parameter which controls the 
decomposition error. 

We employ a matching pursuit-based solver called Batch 
OMP ll43l to solve (|^. We can enforce sparsity either by the 
number of non-zeros in each column of V (i.e., ||v||o) or by 
the total amount of approximation error for each column. 

B. Complexity analysis 

The complexity of sequential column selection (Step 1) is 
0{Pm -f Imn). The complexity terms correspond to com¬ 
puting D+ and DD+A respectively. The projection DD+a 
can be computed for each column of A independently. The 
complexity of sparse approximation (Step 2), using the Batch 
OMP method ll43ll . is 0{lmn -f It^ln), where k < I is the 
average number of non-zeros per column of V. Similarly, for 
each column of A, Batch OMP is applied independently. Let 
ric be the number of parallel processing nodes. By storing D 
(which is a small m x I matrix) and a uniform fraction of 
columns of A in each node (i.e., — columns), the overall 
complexity of Algorithm 1 in a distributed setting can be 
written as 0{^{lm -f k'^l) -f Pm). 


Note that CSSD is linear in terms of both the number of 
data samples n, and the number of processors ric- This is a key 
feature of our approach that makes our framework applicable 
to very large datasets in distributed settings. 

C. Computational benefits of CSSD 

CSSD provides computational benefits when the size of the 
decomposition is small (i.e., I is small relative to m) and/or 
when matrix V is sparse. In general, predicting the amount 
of savings in computation is a function of (i) the structure of 
the data and (ii) the amount of accuracy required from the 
learning algorithm. We now discuss some key factors that 
impact the decomposition results. 

Impact of data structure. Predicting the size and sparsity 
of the decomposition provided by CSSD for an arbitrary 
dataset is challenging; however, when the data lies on a 
single subspace (i.e., exhibits low rank structure) or lies on 
multiple low-dimensional subspaces, CSSD provides a more 
compact representation of the data. For example, when data 
is exactly low rank and its rank is r < to, we must select 
r linearly independent columns from A to form an exact 
decomposition (zero error), i.e., I = r. When the data is 
approximately low rank, there exists a large body of work 
that characterizes the performance of the sequential column 
selection method (Step 1) used to form D iflTll . Il2^ . In 
particular, the selection strategy in Step 2 of Algorithm [T] 
provides exponential decrease in the factorization error with 
each batch of columns that we select from A m. More 
specifically, assume that at each iteration, we select ^ g 
samples from the columns of A and let I = tig denote the set 
of columns selected after t iterations. Let A^ denote the best 
rank r approximation to A and let A = AsAgA denote the 
approximation of A based upon the I selected columns Ag. 
Then according to m, the difference between the expected 
value of the approximation error, i.e., || A—A|||. and that of the 
best rank r approximation || A—Aj.|||. decreases exponentially 
with rate e*. 

Another low-dimensional signal model that has recently 
gained traction models data with multiple low-dimensional 
subspaces (union of subspaces). For example, images of 
objects under different illumination conditions BTI . motion 
trajectories of point-correspondences EH, neural data ini, 
to structured sparse and block-sparse signals Q are all 
well-approximated by a union of low-dimensional subspaces. 
When A lies on a union of subspaces, this effectively bounds 
the sparsity level of each column of V ifTSl . This insight is 
based upon the fact that when we form a representation of 
a column of A with respect to other columns in the same 
dataset (as in CSSD), the sparsity level of each column is 
bounded by the dimension of the subspace the signal lies on. 
For instance, if A lives on a union of multiple r-dimensional 
subspaces of K" and we select at least r linearly independent 
columns from each subspace, then no more than r non-zeros 
are required to represent a signal. In other words, the number 
of non-zeros per column of V is no more the dimension of 
the subspace the signal lives in. 
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Impact of increasing the decomposition error. The decom¬ 
position error of CSSD is controlled by the parameter 6 d 
in Algorithm In the case where we set 6 d = 0, then 
we are guaranteed an exact decomposition of the data. Exact 
decomposition occurs when r linearly independent columns 
are selected from A, where r — rank(A). In this case, the 
selected columns in A 5 will fully represent the data and thus 
||A — AsAJA|| = 0, i.e., exact decomposition occurs. 

While CSSD can produce an exact decomposition when the 
data is exactly low rank (or lies on a union of subspaces), in 
practice, datasets are approximately low rank. In this case, we 
can introduce a small amount of error into the decomposition 
by setting 5d > 0. By introducing some error into the 
decomposition, we observe that both the number of selected 
columns in Step 1 of Algorithm and the sparsity level 
of V can be reduced further. In Figure we show how 
increasing the decomposition error produces a more compact 
decomposition. In Section]^ we study and discuss the impact 
of the decomposition error on the accuracy of a learning 
algorithms that we apply RankMap to. 


error is small enough to ensure that 5l ts within the error- 
specified tolerance. Depending on the underlying computing 
resources available, RankMap can be applied for multiple 
values of 5o in parallel and the largest value of (most 
compact representation) that achieves a particular value of 
is selected. 

When computing resources are constrained and thus running 
the algorithm for multiple values of 5o in parallel is not 
possible, we use a bisection method. In essence, the idea is 
to; (i) set the factorization error to predefined maximum value 
^max ^ jjj experiments) and evaluate 5l, (ii) if 5l is 
below a target value then we stop, otherwise we decrease 
5d by half. By exponentially decreasing Sdi we are also 
guaranteed to decrease 5l exponentially, provided that there is 
a polynomial relationship between the two quantities. We ob¬ 
serve a polynomial relationship holds both in theory cni and 
in practice. In Section VII we provide empirical results which 
demonstrate the connection between the decomposition error 
and learning accuracy for numerous datasets and algorithms 
of interest (see Figures 1^ and [ 


V. Tuning decomposition error for a target 

LEARNING ACCURACY 

In the previous section, we discussed the computational 
benefits associated with introducing some approximation error 
into a CSSD decomposition. Naturally, as we increase the 
decomposition error (controlled by 80)7 the accuracy of our 
learning algorithm will be affected. Thus, the key question 
is how much decomposition error we can afford to achieve 
a certain degree of accuracy in learning. The answer to this 
question heavily depends on the specific learning algorithm 
and the application of interest. 

Previous theoretical studies have established a connection 
between the total error in a factorization of a kernel (or 
Gram) matrix and the accuracy of certain popular learning 
algorithms, including: kernel ridge regression and kernel SVM 
CQl. While for some learning algorithms, our framework 
can exploit previous work to relate 5d and the learning 
error which we denote by 5l, the aim of this section is to 
propose a generic approach for tuning the factorization error to 
achieve a specified learning accuracy. We do this by iteratively 
remapping of the data to find a compact decomposition that 
satisfies a learning error (specified by the user). 

A. Error tuning 

Given an already established relationship between the de¬ 
composition error and a specific algorithm, a practitioner who 
uses our framework can easily specify the error parameter 5d 
for CSSD to achieve a particular learning accuracy. Alterna¬ 
tively, if a practitioner specifies a target accuracy for a learning 
algorithm, the decomposition error Su can be tuned in order 
to achieve a particular learning error 5l- 

Our strategy for guaranteeing that we have small learning 
error, is to solve CSSD for a particular 5d, map the resulting 
decomposition via the methods described in Section |Vl] and 
then compute the accuracy of the learning algorithm 5l- We 
then iteratively add columns to D such that the decomposition 


VI. Distributed execution and data partitioning 

In this section, we introduce our approach for applying 
iterative updates on the decomposed data (the execution phase 
in Figure |^. We describe an execution flow for dependency- 
matrix based updates (i.e., Gx = V^(D^D)Vx) and intro¬ 
duce an efficient method for partitioning the decomposed data 
in a distributed setting. We also provide performance bounds 
on memory usage, the number of flop operations, and the 
number of communicated bytes across the computing nodes. 

A. Computation flow 

We propose two computational models for the distributed 
implementation of an update in Q. Recall that at each 
iteration, we must compute z = Gx = V^(D^D)Vx. 
We break this computation into four steps: (i) p = Vx (ii) 
r = Dp, (iii) p = D^r, and (iv) z = V^p. The output vector 
z is used to produce an update of = /(z + b), where 

b is an offset vector, and /(•) is a low-complexity function 
such as a soft-thresholding operator (sparse approximation) or 
normalization (power method). To carry out the computation 
described above, we propose and implement a matrix-based 
and vertex-based model to apply the iterative updates on the 
decomposed factors. We now describe our implementation of 
both models. 

B. Matrix-based model 

Figure shows the schematic of our proposed matrix- 
based model. In this model, the data is stored in arrays. 
The sparse matrix V is stored and operated upon using the 
Compressed Sparse Column (CSC) format. The matrix D and 
vector X are stored using regular dense arrays. By doing so, 
we exploit sparsity in V. We use C-n- Eigen Library for array 
manipulation and MPI for distributed computing. 
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r = Dp 

T 

2 

p = Vx 

t 



i 

p = D'''r 


x = VTp 


Fig. 3: Distributed design of matrix-based model. 



Fig. 4: Distributed design of graph-based model. 


1) Distributed partitioning: We partition columns of V 
uniformly across the computing nodes to achieve a balanced 
partitioning. Let us assume that there are ric computing nodes. 
Thus, — number of columns are assigned to each node. The 
vector X is also divided into chunks of size — x 1. Each 

Tic 

chunk is then allocated to the node that hosts the corresponding 
columns of V. 

Matrix-vector multiplications Vx are performed locally on 
the columns of V and the portion of x that reside on the 
same computing node. The resulting I x 1 vectors are then 
sent to a central node to create p = Vx. Next, D^(Dp) is 
computed locally in the central node. The resulting I x 1 vector 
is then broadcasted back to all the computing nodes where it 
is multiplied by the local to update the vector x. 

2) Performance bounds: We now provide bounds on the 
memory usage, computation, and communication required by 
our proposed matrix-based model. We also provide the per¬ 
formance bounds for baseline, in which we perform iterative 
analysis on A^A in matrix format. Let nnz(.) denote the 
number of non-zeros of its input and denote the number 
of computing nodes. Recall that D is a m x Z matrix and 
V is a Z X n matrix. Note that in our target data scenarios 
m n and Z ^ n. In many cases, the rank of decomposition 
I is often much smaller than the dimensions of data m. When 
data exhibits union of subspace property, V will be sparse, 
i.e., nnz{y) < In < mn. 


Matrix-based 

Baseline 

RankMap 

Memory usage oc: 

# non-zeros 

mn 

(nnz()V) Im) -\-n-\-m 

Computation ex : 

# additions 
# multiplications 

2mn -\- rnric 
2mn 

2{nnz(V) lm-\- InSj 
2(nnz{y) + Im) 

Communication ex: 
#edges 

2lm 

2lnc 


Since V is stored in a CSC format, only the non-zero 
values are stored and operated on. The matrix D is stored in a 
regular dense matrix format. The communication corresponds 
to sending and receiving the I x 1 vectors from each computing 
node to the central node.Clearly, for smaller I and sparser V, 
both memory footprint and the number of arithmetic opera¬ 
tions are reduced. The number of edges, which correspond 
to the number of broadcasted and reduced values, directly 
corresponds to I and the number of computing nodes n^. 


C. Graph-based model 

Figure shows an schematic of our proposed graph 
model. The decomposed data is three-layer graph denoted 
by Ga{Sx,Sp,Sr) with vertex sets Sx = in 

the bottom layer, Sp = in the middle layer, and 

Sr — {f?i} in the top layer. Each non-zero element in V, 
e.g., Vij, is represented by an edge which connects Xi to Pj. 
Each column of D, e.g., D^, is represented by an edge which 
connects Pi to Ri. Value of vertices in Sx correspond to the 
elements of x. 

We use GraphLab Distributed API 041 to implement this 
model. While GraphLab is a highly optimized distributed 
engine for Graph-based computation on iterative data, we 
perform extensive customizations in order to adapt GraphLab 
to our factorized setting. We also force GraphLab to use 
our developed graph partitioning method as opposed to its 
automated partitioning schemes. Our proposed partitioning is 
customized to the factorized data and significantly improves 
the performance. 

1) Distributed partitioning: In the graph-based model, we 
partition Qa{Sx: Sp, Sr) with the aim of balancing the num¬ 
ber of components assigned to each node and also minimizing 
the inter-node communications characterized by the edges. 
Since the edge distribution of is highly non-uniform 
(I ^ n), a vertex partitioning inevitably results in many 
undesirable edge-cuts across the computing nodes. Instead, we 
apply a vertex-cut method in which the goal is to partition 
graph edges evenly such that the number of vertices that are 
spanned across multiple partitions is minimized. As a result 
of edge partitioning, vertices may reside on two or more 
computing nodes. In this case, we assign one of the copies to 
be the master vertex and the others to be the replica vertices 
(these definitions are borrowed from GraphLab ll25l L The 
replicas directly cause (expensive) inter-node communication 
costs. 

Eigure shows the graph-based distributed design. Our 
detailed edge partitioning method is as follows, (i) Distribute 
master of vertices Xi G Sx uniformly onto the available 
computing nodes such that vertex chunks of size are 
assigned to each node, (ii) Add the edges between vertices 
Xi G Sx and Pj G Sp to the node in which the corresponding 
master of Xi resides, (iii) Add master of vertices Pi G Sp and 
Ri G Sr to a central node, (iv) Add the edges between the 
vertices Pi G Sp and Ri G Sr to the central node. 

The proposed edge partitioning algorithm is highly efficient 
in that it does not induce any replicas for vertices in Sx 































and Sr. However, from Step (ii), replicas of vertices in Sp 
may exist in computing nodes other than the central node. 
At the beginning of an iteration, master vertices in Sp and 
their replicas perform vertex updates with respect to Sx- The 
replicas send the updated values to their own master vertices in 
the central node. The master vertices in S'p reduce the received 
values (p = Vx). Then master vertex i?i performs a vertex 
update (r = Dp — y). Next master vertices in Sp complete 
vertex updates with respect to Sr and broadcast the results to 
their own replicas (p = D^r). Finally, master vertices in Sx 
update themselves (x = V^p). We integrate and implement 
the proposed customized partitioning and distributed compu¬ 
tation flow with the distributed GraphLab API lf25l . 

2) Performance bounds: We now provide bounds on the 
memory usage, computation, and communication required by 
our proposed graph-based model. 

• Memory usage 

# edges cx nnz(V) + 1. 

# vertices cx n -f trep{Pi). 

• Computation (per iteration) 

# additions cx 2(nnz(V) -f ml) + J2i<i<i 'fe.p[Pi). 

# multiplications cx 2{nnz(V) + ml). 

• Communication 

# edge-cuts cx 2 rep{Pi). 


Graph-based 

Baseline 

RankMap 

Memory usage (x: 

# edges 
# vertices 

mn 

n + mric 

nnz{V) + 1 
" + Ei<i<! rep(Pi) 

Computation x : 

# additions 
# multiplications 

2mn -f mric 
2mn 

2(nnz(Y) + ml) + Yii<i<i rep(Pi) 
2(nnz(V) -f Im) 

Communication x: 
#cuts 

2lm 

2Ei<i<irep(Pi) 


Each of the computing nodes receive approximately ^(n-f 
Si<i<i vertices and ^nnz(V) edges. The central 

node has I additional edges between the master vertices in Sp 
and i?i. The computation cost is induced by vertex update 
operations. The communication overhead is incurred by the 
message passing across master and replica vertices in Sp. 

Bound on total replicas. From the discussions above, it 
is clear that reducing number of replicas of Sp reduces the 
communication overhead. The following are the bounds on the 
total number of replicas: 

I < ^ rep{Pi) < Iric. 

l<i<l 

The inequalities hold since each Pi is replicated at least 
once and at most ric times (one replica per computing 
node). Both I and ric are much smaller than the size of the 
graph. Thus, RankMap’s graph-based model readily provides 
efficient/balanced computation and reduced communication 
without using complicated and costly graph partitioning al¬ 
gorithms. The minimum communication is achieved when V 
is block-diagonal. 

VII. Evaluations 

In this section, we evaluate the performance of RankMap 
on a variety of datasets. Our evaluations explore: (i) the 


scalability of CSSD and its ability to produce sparse represen¬ 
tations, (ii) the connection between decomposition error and 
learning accuracy for multiple learning applications including 
face recognition, image denoising, and PCA, (iii) RankMap’s 
performance improvement in terms of runtime, and memory 
over prior work, and (iv) the performance of our distributed 
matrix- and graph-based models for different structured data 
sets. 

A. Evaluation setup 

1) Datasets: We evaluate RankMap on both real and struc¬ 
tured synthetic datasets. Our real datasets include Light Field 
data El, hyper spectral images m, a dictionary of video 
frames ll28l . and a collection of images of different faces under 
varying illumination conditions ll22l . 

We apply RankMap to two different Light Field datasets. 
The first dataset, which we refer to as Light Field (i), consists 
of lOfc randomly selected atoms from a 5 x 5 Light Field array 
(collected from Chess Images). Each Light Eield patch consists 
of 25 8 X 8 -patches which produces a dataset of size 1.6k x lOA: 
(128MB). The second dataset, which we refer to as Light 
Eield (ii), consists of lOOfc randomly selected atoms from a 
17 X 17 Light Eield array (collected from all available Light 
Eields in the archive). Each Light Eield patch consists of 289 
8 X 8 -patches which produces a dataset of size 18496 x lOO/c 
(14:.7GB). The hyper spectral dataset (Salinas) is taken from 
a region of a remote sensing scene in Salinas, CA. Each pixel 
in the scene contains information from 203 spectral bands and 
produces a dataset of size 203 x 54129 (87.9MB). The video 
dictionary dataset (VideoDict) contains patches of an image 
over multiple frames and produces a dataset of size 1764 x 
100000 (1.4AGB). The face image dataset (Eaces) consists 
of 631 images of 10 different peoples faces under varying 
illumination conditions. Each image is 48 x 84 pixels, which 
produces a dataset of size 4032 x 631. In addition to real-world 
datasets, we generate synthetic data for n = lOM, m = Ik 
with varying I and sparsity levels in V. 

2) Computing platform: To evaluate the decomposition 
methods on Light Field (i) an 8 -core CPU (Intel Core™i7 
processor) with 12GB of RAM is used. For computations on 
Light Field dataset (ii), we instanciated a cluster of 16 m3 .large 
nodes (machines) on Amazon EC2. Each node has 16 cores 
(two Intel Xeon processors) at 7.5GB of RAM per node. The 
synthetic datasets are evaluated on IBM iDataPIex computing 
cluster which has 2304 cores in 192 Westmere nodes (12 
processor cores per node) at 48GB of RAM per node. 

3) Distributed tools: All RankMap’s APIs are available to 
the public E]. The RankMap framework’s sparse matrix-based 
model is implemented using Eigen library to represent data in a 
compressed column storage (CCS) format lIZTl . It uses MPI’s 
standard system to distribute the data and computation and 
is written in C-H-. We have also implemented the distributed 
update on the factorized data on Apache Spark ll49ll . 

The RankMap framework’s sparse graph-based design is 
implemented using GraphLab, a high-level graph-parallel ab¬ 
straction i^ . GraphLab enables vertex-update-based compu¬ 
tations. We implemented RankMap’s customized partitioning 
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using Graphlab’s ingress class. The proposed architectures 
are mapped efficiently into GraphLab API (Section VI-Ci. 
Note that the GraphLab framework is designed to accelerate 
distributed learning for sparse graphs and thus it is not suited 
to process dense data until we sparsify the data using CSSD. 


Test Image 



B. Scaling of CSSD 

Figure shows how the runtime of CSSD scales as the 
number of processors increases for the VideoDict dataset. We 
increase the number of cores from 4 to 256 (on IBM iDataPIex 
cluster). The dotted line shows the ideal scale-out behavior. 
As can be seen, CSSD is highly parallel as it almost linearly 
scales with the number of processors. Thus, it can be applied 
to very large datasets. 



Fig. 5: CSSD’s runtime scaling behavior as the number of 
processors increases. 


C. Sparse approximation 

To evaluate the performance of RankMap for sparse ap¬ 
proximation, we use the fast iterative shrinkage-thresholding 
algorithm (FISTA) i) to solve the -minimization problem 
in We study the utility of RankMap for two applications: 
sparse representation-based classification for face recognition 
and image denoising (see Section II-B for more details on 
these applications). 

1) Sparse representation-based classification for face 
recognition: To employ sparse approximation for classifica¬ 
tion, our aim is to use a collection of labeled images (training 
set) as our dictionary A and then form a sparse representation 
of a test image y in terms of A. After finding a sparse 
coefficient vector x, we can then determine which signals in 
the testing set (columns of A) are selected to represent the 
test signal y. Based upon the class of the selected columns, 
we then make a decision about which class the test signal lies 
in. One easy way to do this is to simply sum the absolute 
value of the coefficients in x in a certain class and then find 
the class with maximum sum. 

In Figure we provide a demonstration of sparse 
representation-based classification for face recognition. We 
show the test image of interest on top and the corresponding 
sparse coefficient vector obtained by solving 0 with FISTA, 
where A = 1. We solve FISTA with the full Gram matrix 
A^A and approximate Gram provided by CSSD, where the 
decomposition error Sjy = 0.05 {I = 62). 

To understand the connection between the decomposition 
error and learning accuracy for face recognition, we solve 


in 

C 0.025 

.2 

O 0.02 

£ 

O 0.015 

o 
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5 0.005 
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Fig. 6: Sparse representation-based face recognition. We show 
the sparse coefficients obtained with the original Gram matrix 
(blue, solid) and approximate Gram matrix with CSSD for 
Sd = 0.05 (red, dash). On top, we show the test image, 
and four training images that produce significant non-zero 
coefficients. Both the baseline and our approach result in 
correct classification, as their largest coefficient is associated 
with a training example from the same class as the test image. 
The block of coefficients corresponding to images in the 
correct class is highlighted. 



© using FISTA for two different regularization parameters 
A = {0,1}, where A = 0 corresponds to the least-squares 
solution and A = 1 produces sparse solutions. We vary 
the decomposition error S]j = {0.4,0.2,0.1,0.05} and solve 
FISTA for 30 different test images (after removing them 
from training) for each of these decompositions. We calculate 
the: learning accuracy by measuring the £ 2 -norm between the 
solution obtained with the full and approximate Gram (Figure 
7a I, the sum of coefficients in the correct class (Figure [7b)l, 


and the relative density of V (the number of non-zeros in V 
versus the number of non-zeros in A) (Figure 7c 1 . In Figure 
[7b| we also display the minimum sum of coefficients required 
to correctly classify the test image. In this case, we obtain 
the correct class with this approach when Sd < 0.2. These 
results suggest that even when we allow a significant amount 
of decomposition error, correct classification is still possible. 

2) Light Field image denoising: We evaluate RankMap’s 
performance in denoising Light Field data. A Light Field is 
a multi-dimensional array of images where each image is 
captured from a slightly different viewpoint. To reconstruct 
and denoise light fields, £ 1 -minimization •El is employed to 
find a sparse representation of a Light Field image with respect 
to an overcomplete Light Field dictionary consisting of a large 
number of Light Field image patches collected from many 
scenes ll3^ . This dictionary can be used to reconstruct light 
field patches for the purpose of super-resolution and denoising. 
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Fig. 7; Studying the impact of the decomposition error on learning accuracy. For a range of decompositions (varying 6 d) we 
show (a) learning accuracy which measures the 2-norm between the solution obtained with the full and approximate Gram, 
(b) sum of coefficients in the correct class, the minimum sum of coefficients required to correctly classify the test image is 
also shown, and (c) relative number of non-zeros in V versus the number of non-zeros in A. 


We study the performance of RankMap for reconstructing 
light field patches from noisy observations (image denoising). 
In all of our denoising experiments. Light Field (ii) is used. 
We first apply CSSD for decomposing the dictionary cor¬ 
responding to the Light Field (ii) dataset, for two different 
values of I = 240 and I = 1000. The decomposition error 6 jj 
is set to 0.1. After decomposing the data, we then evaluate 
FISTA on the decomposed data with the matrix-based model. 
We compare RankMap’s performance with that of a tailored 
distributed MPI-based model to evaluate FISTA on the original 
dataset (A) using regular dense matrix representations. This 
implementation is denoted as the baseline in our evaluations. 

TableUshows the total runtime of FISTA to achieve different 
PSNRs. The Peak Signal to Noise Ratio (PSNR) is the 
ratio between the maximum possible power of a signal and 
the power of the corrupting noise, is used to measure the 
performance of denoising algorithms. The PSNR is defined as 
101ogiQ(;^^=) (dB), where MAX is the maximum pixel 
value of the original image patch and MSE is the mean 
square reconstruction error defined as . Typically in 

image noise reduction applications, PSNR values of 30 dB 
and higher are desired ma, m, a. 

In all the experiments, a batch of ten noisy patches are 
used as the input and the norm of the noise is set to 0.3 times 
the norm of the input vector (PSNR=2L14). We observe that 
RankMap can achieve the same PSNR orders of magnitude 
faster than the baseline implementation. For instance, if our 
desired PSNR is SO.OdB, running FISTA on the decomposed 
data takes 13.9s (I = 240) and 162ss (I = 1000), while it takes 
1050s for the baseline. However, as expected, the baseline (A) 
reaches higher PSNRs in comparison with those achieved from 
running FISTA on the decomposed data. Thus RankMap can 
be used to tradeoff learning accuracy for speed. 

D. Power method 

We also evaluate our framework on power method for three 
datasets: Salinas, VideoDict, and Light Field (i) (see Section 
for more discussion of the power method). The matrix- 
based model is used and the experiments are run on 64 
cores on an IBMiDataplex cluster. We run CSSD with various 
decomposition errors {5d) that belong to the following set: 


TABLE I: Runtime to reach to a specific output PSNR. We 
apply FISTA to solve the denoising problem on A as well 
as two decompositions of A that are derived by setting I 
to 240 and 1000 (in Algorithm 1). The lower dimensional 
decomposition (I = 240) reaches to up to 30 dB output PSNR 
much faster, due to its lower memory footprint and computing 
requirements. Similarly, I = 1000 reaches to up to 35 dB 
PSNR much faster than A but cannot reach to 40 dB PSNR. 


PSNR (dB) 

II 

to 

o 

1 = 1000 

baseline (A) 

25 

4.2 

72 

487 

30 

13.9 

162 

1050 

35 

- 

356 

2051 

40 

- 

- 

3171 


{0.4, 0.2,0.1, 0.05, 0.001} and run the power method on each 
of the decomposition results. Figure [^shows the sparsity of V 
as we vary the error. As expected, for larger error tolerances, a 
sparser decomposition is achieved. Figure]^ shows the impact 
of the decomposition error on the accuracy of the results of the 
power method. Here, the learning error (Sl) is defined to be the 
normalized accumulated error of the first 100 eigenvalues. By 
lowering the decomposition error, we can observe significant 
improvements in the accuracy of the power method. Finally, 
Figure[^shows the corresponding normalized runtimes to find 
the first 100 eigenvalues. Our results suggest that significant 
speedups are achieved in comparison with the baseline. 

E. Graph- vi. matrix-based models 

We compare the performance of RankMap’s vertex and 
matrix-based models for various synthetic decomposed data. 
The purpose of these evaluations is to determine the advan¬ 
tages of each of the model, with respect to the structure of 
the data. In all the experiments, the iterative update in ([^ is 
applied on a random input vector y. The experiments are done 
on an IBM iDataPlex computing cluster. In all the figures, 
the runtime results for the dense matrix-based implementation 
(i.e., regular deployment of the decomposed matrices without 
using CCS format) are provided to demonstrate the efficiency 
achieved by exploiting sparsity in V through sparse matrix- 
based and graph-based models. For the former model, we 
report analysis based on our C-n- MPI implementation and for 
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(a) 




(b) (c) 


Fig. 8; Power method results for three datasets: Salinas, VideoDict, and Light Field (i). In (a), we show the effect of varying 
the decomposition error {Sd) on the sparsity of the factor V. Reported values are the number of non-zeros in V normalized 
to the number of non-zeros in A. In (b), we show the learning error (6^) and in (c) the relative speedup of power method to 
find the first 100 eigenvalues. Relative values are runtime of power method using the decomposed factors derived from CSSD 
normalized to the corresponding runtime using A. 


the latter model we report analysis based on our modification 
of GraphLab engine to implement RankMap. 

The performance of RankMap for different (block-diagonal) 
V’s, with fixed number of non-zeros (set to lOOM), is shown 
in Figure As I increases, the density-level of V decreases. 
The graph-based model’s performance is more consistent as 

1 increases. However, the matrix-based model’s performance 
degrades for larger Fs. This observation can also be explained 
due to the fact that the communication overhead of the matrix- 
based model is more affected by larger Z’s. 

Figure 9b shows the performance for a fixed I — 500 on 
block-diagonal matrices V, for varying densities of V. As 
density increases, the performance decreases in both models. 
However, the performance degradation in graph-based model 
is worse due to the overhead of representing a large number of 
edges. Lastly, Figure shows the scaling performance of the 
models for various number of processors. When the number 
of processors is less than 12, the computations are done on a 
single node. Thus the reverse scaling behavior while increasing 
the number of processors from 8 to 16 is due to the high over¬ 
head of the inter-node communication cost. For comparison 
purposes, we report the scaling of the baseline approach that 
operates on the original dense m = Ik hy n = lOM dataset, 
instead of its decomposed form. It can be seen that as the 
number of processors increases, the performance gap between 
different methods shrink. However, even with a large number 
of processors (> 100), the decomposed models perform up to 

2 orders of magnitude better than the baseline. 

These experiments provide insight into the use-case of each 
model. Depending on the structure of the decomposed data and 
the specifications of the platform, an appropriate model should 
be selected. A more systematic domain-specific approach for 
model selection is the subject of future work. 


F. Memory Analysis 

Table compares the required memory for storing matrices 
V and D. The memory usage of the original matrix A is 
also provided. We also provided the memory savings for the 
case in which D is formed in the same fashion as CSSD 
but V is computed via least-squares, as opposed to OMR 


RankMap results in up to 77.8 x (memory usage) improvement 
over A^A and 8.6 x improvement over the adaptive norm-2 
projection based decomposition. The approximation error for 
both decomposition methods is set to 5d = 0.1. 


Memory size 
(MB) 

Original 

data 

Least-squares 

RankMap 

Salina 

87.9 

86.9 

38.2 

VideoDict 

1411.2 

835.0 

279.2 

Light Field (ii) 

14796.8 

1634.8 

190.1 


TABLE II; Memory analysis. RankMap achieves significant 
improvement via its adaptive and sparsity-inducing approach. 


G. Comparison with Spark 

We have already shown the results based on our implemen¬ 


tation of RankMap on GraphLab in Figures 9a 9b and 9c 


Now we provide runtime comparisons between RankMap and 
a Apache Spark-based implementation of the power method 
on the baseline A versus on the decomposed data DV. Recall 
that the core iterative update function used in power method 
is provided in Q. We report the average runtime per iteration 
with Spark and our implementations on the same hardware. 

Figure [T^ shows the average runtime per iteration on a 
cluster of 8 nodes, 8 core per processor for Salinas, VideoDict, 
and Light Field (ii) datasets. As expected, our carefully 
tailored implementation of RankMap based on C-H- MPI, 
performed significantly better than Spark, by more than 2 
orders of magnitude in some cases. This gap in performance 
is in part due to our particular implementation of RankMap, 
which carefully partitions the data such that the computation 
per core is balanced and the communication is reduced (see 
Section 6.3.2 for performance bounds). In addition. Spark 
provides fault-tolerance which causes extra overhead due to 
data replications. 


VIII. Discussion 

This paper introduced RankMap, a novel distributed frame¬ 
work for applying a host of iterative learning algorithms 
on large-scale dense and structured datasets. Our framework 
leverages low-dimensional structure in datasets in order to 
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Fig. 9: Comparison of matrix- and graph-based computational models on synthetic block diagonal data. We compare the (a) 
runtime vs. size of factorization, (b) relative density of V, and (c) number of processors. 



Fig. 10: The average runtime per power method iteration for 
RankMap and a Spark implementation of the baseline method. 


quickly map a large dataset with dense dependencies into 
lower dimensional components with sparse representations. 
We introduce two computational models, a matrix- and graph- 
based model, that can be used to execute distributed learning 
algorithms. Our framework provides an efficient partitioning 
of the computational flow that guarantees load balancing and 
significantly lowers communication overhead. We apply our 
matrix- and graph-based models to numerous real-world and 
synthetic datasets and demonstrate significant improvements 
in the runtime and memory footprint. 

There is an unavoidable cost associated with factorizing the 
data. For extremely large datasets however, this initial cost can 
pay off a lot. The performance gain achieved on the subsequent 
computations justifies the one-time decomposition overhead. 
For example, we decompose Light Field (ii) dataset (Section 
|V11| | on a cluster of 4 nodes (each with 12 cores) on IBM 
iDataPlex. For I = 240, the decomposition is completed in 
less than 15 minutes. For 10 sample patches (each of length 
18fc) the overall reconstruction time is reduced from more than 
1000s to below 20s. Thus, the offline decomposition overhead 
can be justified once considering that there are thousands of 
patches in a single light field. Moreover, the same dictionary 
can be used to reconstruct other light field datasets. 

There are a number of existing column sampling-based 
methods that aim to improve the performance of specific learn¬ 
ing objectives, such as least-squares jJTl . ^ 2 -niiiiimization 
with square root £i penalty gOl . and SVM GO). RankMap 
is unique in that uses column sampling to improve the perfor¬ 
mance of a broad class of ML algorithms that operates on the 
Gram matrix. Moreover, RankMap relies of the sparsity of the 
decomposition for further improvement in runtime, energy and 
memory usage. Finally, to the best of our knowledge RankMap 


is the first end-to-end framework that is equipped with open- 
source supported APIs 0. 

Sparse matrix factorization approaches such as SPCA and 
KSVD have objectives similar to CSSD, however, their com¬ 
plexity make them difficult to apply to massive datasets. As 
we sample the dataset instead of learning a factorization of the 
data, our proposed decomposition is faster and scalable. Whilst 
our sampling-based approach is effective, the decomposition 
phase in RankMap (see Figure can be readily replaced by 
other sparse decomposition approaches. Tradeoffs between the 
time to compute a factorization (via learning or sampling) and 
how sparse we can make the decomposition are likely to exist. 
Although outside of the scope of this current work, it would 
be interesting to study the utility of learning in terms of its 
later computational benefit. 

Our graph-based and matrix-based computational models 
provide advantages in different data regimes. Thus, it is natural 
to ask which model to select for data processing. Both models 
follow the same computational flow and operate only on 
the non-zeros. In practice, we observe that the matrix-based 
approach is faster: this is especially true when we exploit 
sparsity in the decomposition with a sparse matrix-based 
approach. This is likely due to the fact that the graph-based 
model requires extra overhead to store and operate on the 
vertices and edges. The main advantage of the graph-based 
model is in its reduced communication cost (Section |VI-C2| l. 
When V is completely block diagonal the communication 
becomes almost independent of the number of computing 
nodes ric and is only proportional to 21. In contrast, the 
communication cost in the matrix-based model is always 
proportional to 2lnc. This difference may result in a better 
overall performance of the graph-based approach, especially 
for larger I and ric values. In general, the performance of each 
model is highly dependent on the specifications of the available 
computing nodes including the communication bandwidth and 
computation power (e.g., maximum floating point operations 
per second). Our evaluations in Section [VII-E provide further 
insight into the differences of the two models. 

Throughout our experiments, we used FISTA, an acceler¬ 
ated gradient descent method, as an optimizer. Our computa¬ 
tion/communication and memory minimizing framework can 
also be applied to other optimization methods such as Stochas¬ 
tic Gradient Descent (SGD) and Stochastic Coordinate 
Descent (SCD) Both SGD and SCD operate on the entire 
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mxn dataset, however, each iterative update is performed on a 
subset of rows (as in SGD) or along the columns (in SCD). For 
this reason, the convergence of stochastic method is slower. 
SGD can be integrated within RankMap by breaking the m 
rows of matrix A into batches, and performing RankMap’s 
decomposition on each batch. SCD can also be applied to the 
factorized dataset DV instead of A. In general, RankMap is 
not limited to a particular optimizer, it is beneficial whenever 
there is a need to store/ and or iteratively perform matrix 
multiplication on large datasets. 

In this work, we show how sparse matrix factorization and 
adaptive sampling can be used to speed up iterative optimiza¬ 
tion algorithms on large datasets. We have mainly explored 
its use for computational gains, however, recent theoretical 
results have shown that subsampling data can also be beneficial 
for learning HI. RankMap provides a new computational 
framework from which we can begin to test the ideas of 
efficiency, both in terms of quality of learning and computing 
performance, through randomization and subsampling. 
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